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Abstract: Low-light images suffer from severe noise and low illumination. Current deep learning models that are trained with 
real-world images have excellent noise reduction, but a ratio parameter must be chosen manually to complete the enhancement 
pipeline. In this work, we propose an adaptive low-light raw image enhancement network to avoid parameter-handcrafting and 
to improve image quality. The proposed method can be divided into two sub-models: Brightness Prediction (BP) and Exposure 
Shifting (ES). The former is designed to control the brightness of the resulting image by estimating a guideline exposure time tı. 
The latter learns to approximate an exposure-shifting operator E'S, converting a low-light image with real exposure time to toa 
noise-free image with guideline exposure time tı. Additionally, structural similarity (SSIM) loss and Image Enhancement Vector 
(IEV) are introduced to promote image quality, and a new Campus Image Dataset (CID) is proposed to overcome the limitations 
of the existing datasets and to supervise the training of the proposed model. In quantitative tests, it is shown that the proposed 
method has the lowest Noise Level Estimation (NLE) score compared with BM3D-based low-light algorithms, suggesting a superior 
denoising performance. Furthermore, those tests illustrate that the proposed method is able to adaptively control the global image 
brightness according to the content of the image scene. Lastly, the potential application in video processing is briefly discussed. 


1 Introduction 


Images play an irreplaceable role in industry, military and entertain- 
ment. As the art of light, how to obtain better images in the low-light 
environment has been extensively studied in the literature. Some 
advanced photography equipment can help but at an expensive cost. 
On the other hand, there are few limitations to post-process low-light 
images. However, it is challenging due to problems such as noise, 
color distortion et al. 

Exposure time, ISO (which measures the sensitivity of the image 
sensor) and aperture are known as three pillars of photography. With 
extended exposure time, the low-light problem can be easily solved, 
but it is not realistic because if the camera is not fixed or the scene 
contains moving objects. Higher ISO introduces more noise and is 
not always available for mobile devices. As a flexible solution, low- 
light image enhancement provides an alternative for imaging in the 
low-light environment. 

In general, current low-light image enhancement method can be 
classified as follows: 

Classic low-light image enhancement methods, with no appli- 
cation of convolutional neural networks. Histogram equalization 
(HE)[1] and gamma correction[2] provide simple solution for low- 
light image processing. Based on Retinex theory[3], Single-scale 
Retinex (SSR)[4], Multi-scale Retinex (MSR)[5], SRIE[6] and 
LIME[7] were proposed, enhancing low-light image by estimating 
the illumination map and reflectance map. Dehazing[8] methods can 
also be applied to this subject, regarding low-light images as inversed 
hazed image. 

With the rapid development of the deep neural network, 
researchers have combined classic low-light image enhancement 
methods with Convolutional Neural Networks (CNNs). Chen et 
al. proposed Retinex-Net[9] to decompose low-light image into 
illumination and reflectance and use BM3D[10] for denoising. 

Those methods assume low-light images are free of noise or noise 
is already removed by additional denoising process. Therefore, when 
applying those methods on real-world low-light images, an addi- 
tional denoising process is required. However, most low-light images 
suffer from severe noise, traditional denoising methods like BM3D 
are not able to provide satisfactory results. To be rid of image noise, 
researchers also explored the method of deep learning[11—14] in 
denoising topics. 
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Fig. 1: Several scenes in CID datasets. 


There are some low-light images which are barely visible before 
post-processing, as they are captured in extremely dark environ- 
ments or with very short exposure time. In this work, those images 
are named as extreme low-light images. While the classic methods 
are unable to tackle the severe noise and serious color distortion in 
those images, it is then discovered that with the help of deep learn- 
ing and raw-format image datasets, we can still obtain satisfactory 
results. Chen et al. proposed an end-to-end solution for raw low- 
light image enhancement[15], dealing with the low-light problem 
and denoising in a single model. However, extra brightness ampli- 
fication ratio is needed and requires manual parameter tuning for 
different scenes, gaining excellent denoising effects but losing the 
flexibility and adaptability of classic methods. 

Currently, all datasets available for low-light enhancement models 
chose a longer-exposed image as the ground-truth image in training, 
which was heavily dependent on the dataset collector’s experience 
to select exposure parameters such as exposure time, ISO and white 
balance mode. Thus these parameters were usually not optimal, con- 
sidering the content of the scene, illumination of the environment et 
al., which can lead to an undesired trained model. 

It is hard for deep CNNs to learn how we determine the best 
exposure parameters for the ground-truth images because this can 
be very subjective and lacks specific standards. It can lead to more 
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Fig. 2: The proposed framework. For a underexposed low-light raw image with guideline exposure time tı already known, the Exposure 
Shifting Network (ESN) is applied to complete the estimation of the RGB result image. In a more general case, Brightness Prediction Network 
(BPN) is utilized to give a proper estimation for the uncertain parameter tı with raw image array and available Exif metadata. 


problems if there is more than one dataset collector. On the other 
hand, making a baseline for ground-truth selection is equally dif- 
ficult. Nevertheless, it is possible to predict how an image varies 
when exposure parameters change. We name this process Expo- 
sure Shifting (ES). With a learned ES model, it becomes possible to 
reversely study what good exposure parameters (e.g. exposure time) 
should be for each low-light image, utilizing the characteristics of 
the back-propagation algorithm. 

The contributions of our work are summarized as follows: 

1) We presented Campus Image Dataset (CID), which contains 
a variety of scenes captured by multi-level exposure. CID pro- 
vides powerful support for the training of our data-driven low-light 
enhancement model. 

2) We proposed an adaptive two-stage low-light image enhance- 
ment model, which provides state-of-the-art low-light noise suppres- 
sion as well as adaptive global brightness control effect: 

In stage one, a Brightness Prediction Network (BPN) was intro- 
duced to estimate a proper exposure time based on the content of 
images as well as Exif (Exchangeable image file) metadata. BPN 
was trained to provide adaptive image brightness control during 
image enhancement and to get rid of the external parameters required 
in the Exposure Shifting stage. 

In stage two, an Exposure Shifting Network (ESN) was proposed 
to estimate a longer-exposed version of the image with target expo- 
sure time estimated by BPN. Moreover, ESN also completes image 
denoising and ends up with the final resulting RGB image. 

The rest of the article is organized as follows. In section Related 
Works, previous research works are provided. The Dataset section 
demonstrates the importance and advantages of our CID (Cam- 
pus Image Dataset) dataset. In the Method section, our model and 
its training details are provided, as well as the design concept of 
the model. The proposed model is compared with other state-of- 
the-art methods in Experiments section, from multiple perspectives 
including denoising, adaptive brightness control performance and 
computational cost. The possible extension in video processing is 
briefly discussed here. In the Conclusion and Discussion section, the 
advantages and the limitations of our model are concluded. We also 
present the future expectations of this work. 


2 Related Works 


The method we proposed was inspired not only but the application 
of deep neural networks, but also by classic methods based on the 
Histogram Equalization model, dehazing model and Retinex model. 


2.1 Histogram Equalization Methods 


Histogram Equalization (HE) method is used extensively to pro- 
cess digital images. The basic idea is to improve image visibility 
by changing the image histogram into a uniform distribution, which 
effectively increases image entropy for low-light images. In [16] 
it is argued that a color-invariant representation can be obtained 
by applying HE. However, HE tends to cause noise amplifica- 
tion, details disappearance and color distortion in low-light image 
enhancement. 

On account of the drawbacks of HE, a number of improve- 
ments have been put forward. Adaptive Histogram Equalization 
(AHE) [17] and its variation Contrast-limited Adaptive Histogram 
Equalization (CLAHE) [18] perform the histogram equalization in 
a region surrounding each pixel, instead of across the whole image. 
Hue-preserving color image enhancement [19] works on contrast- 
enhanceing and hue-preserving, [20] and [21] on preserving the 
original image luminance. 


2.2 Dehazing and Retinex Model-Based Methods 


A number of low-light enhancement methods are based on the 
dehazing and Retinex model. It is observed that low-light images and 
inversed haze images share many similarities. For this reason, [8] 
presented a method in which low-light image is inversed, dehazed 
and then inversed back again. This method was studied further by 
[22] and [23]. 

On the other hand, the Retinex model [3] revealed that a natu- 
ral image is made up of the illumination map and the reflectance 
map. With Retinex decomposition and the assumption that the illu- 
mination map is smooth, researchers have proposed single-scale 
Retinex [24] and multi-scale Retinex [4]. Based on image fusion, 
adjustments can be applied to the illumination map to improve the 
performance [25]. [26] focused on preserving natural characteristics 
with a Bright-Pass filter. In [27], a variational Retinex model was 
proposed and the illumination estimation problem was formulated 
as a Quadratic Programming optimization problem. In [28] illumi- 
nation map is estimated considering local structure and [29] focused 
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on revealing the structure details from the reflectance map. A recent 
study shows that the Retinex model can combine with the atmo- 
spheric scattering model[30]. More refined models were presented 
by [31, 32] based on variational Retinex theory. 

However, noise modeling is not well established in those algo- 
rithms. In [8] it is assumed that noise is insignificant and can be 
removed before or after the Dehazing stage. And the Retinex based 
models rely on classic noise reduction algorithms (e.g. BM3D) to 
reduce noise found in the reflectance map. Therefore it can achieve 
good results in mild low-light images but tends to amplify noise 
when processing severe or extreme low-light images. Besides, the 
variational Retinex model is very time-consuming as it takes many 
iterations to solve the optimization objectives and so are many 
denoising algorithms, thus it is hard to achieve real-time processing. 


2.3 Data-driven Methods 


Image-denoising and low-light enhancement can be realized with the 
data-driven approach, separately or integrated. 

Noise reduction has been the subject of extensive studies. [11] has 
discovered that convolutional neural networks can compete with the 
state-of-the-art classic denoising algorithm BM3D. The works on 
date-driven denoising have been extended by [14, 33, 34]. Although 
these works were not targeted at low-light image processing, they 
provide some references for the research of other image processing 
studies. 

The first application of the data-driven method is LLNET [12], 
which utilizes a deep auto-encoder to learn from synthetic low-light 
image datasets and the object of denoising is integrated into the low- 
light enhancement process. Also, it was demonstrated that Retinex 
based methods can be further improved by deep learning [9, 35], but 
those works still have undesirable denoising performance because 
synthetic datasets are used and they cannot reflect the characteristics 
of real low-light images. 

To solve the drawbacks of synthetic datasets, a large multi- 
exposure image dataset was proposed in [36], but it was found by 
[15] that models that utilize raw-format images can provide much 
better results in low-light enhancement. In [15], it suggested that the 
state-of-the-art results can be achieved by using raw-format paired 
images and training in an end-to-end way. However, since the paired 
two images have different exposure time, a ratio was introduced as an 
additional input to bridge the gap. Thus after the model is trained and 
with no reference to indicate the ratio, [15] requires manual adjust- 
ment for every different image to be enhanced, resulting in many 
drawbacks in practical applications. 


3 Dataset 


For extreme low-light image enhancement, there are few optional 
datasets. The Google HDR+ dataset[37] and Darmstadt Noise 
Dataset[38] avoid the disadvantages of synthetic datasets, but they 
were mainly collected in environments with sufficient illumina- 
tion. The RENOIR[39] and learning-to-See-In-the-Dark dataset 
(SID)[15] provides real low-light image pairs, which is captured by 
carefully selecting the ISO and exposure time for each scene. The 
Exclusively Dark Dataset (EDD) [40] is another low-light dataset 
with object-level annotations, which is targeting at object recog- 
nition. But these datasets cannot satisfy the need of our study. In 
our work, in order to adaptively enhance images with different 
noise levels, it is needed to consider how the exposure parameters 
have an impact on low-light images. More specifically, as the envi- 
ronment illumination grows lower, low-light images are gradually 
overwhelmed by noise and invalid pixels. Although it is not flexi- 
ble to manipulate the environment illumination, the camera exposure 
time setting can be easily changed to simulate this illumination shift. 
Therefore, it is expected that there is a dataset containing, instead of 
just some image pairs, a number of multi-exposed image series, in 
each of which there are low-light images of different levels and one 
reference image captured in the same scene. 

Overall, there are three conditions need to be met for a satisfying 
dataset: 
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e Real scenes. Unlike synthetic images, photos captured in real 
scenes reflect much more diverse noise and distortion pattern 

e Multiple exposure levels. Using a number of multi-exposed image 
series to train an adaptive model capable of enhancing low-light 
images of different levels. 

e Raw-format images. Most data captured by the camera sensor 
is lost in format convertion, those lost infomation is essential for 
extreme low-light image enhancement. 
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Fig. 3: Demonstration of a CID image group. 


Based on the above description, we proposed our Campus Image 
Dataset (CID). There are about 200 image groups in CID, each of 
which consists of 8 raw images with different exposure time, shot 
continuously in the same scene using a tripod (Fig.3). To be spe- 
cific, for every group, an upper limit of exposure time was selected 
to make sure that the first image in the sequence was slightly overex- 
posed, then a lower limit was chosen to capture an extreme low-light 
image as the last image in the sequence, and the gap between the lim- 
its was filled up by another 6 images. Finally, the ground-truth image 
for each group was manually selected, and invalid images that had 
nothing but noise due to extreme short exposure were tagged and 
excluded. A demonstration of scenes in CID is presented in Fig.1. 


4 Method 


4.1 Method Overview 


Most of the existing methods are unable to suppress the severe 
noise present in extreme low-light images. On the other hand, 
recent studies e.g. [15] have concentrated on the extreme low-light 
enhancement with raw-format images and real-scene datasets in an 
end-to-end way, but unlike the classic methods, the brightness of the 
enhanced image cannot be automatically controlled in this frame- 
work, more specifically, an external ratio has to be manually adjusted 
when the input image has no paired image as the reference. 

It is our aim to find a solution to enhance extreme low-light 
images, combining the advantages of adaptive brightness control and 
superior denoising performance. It is found that the problem of low- 
light enhancement can be considered as a special case of changing 
the time of exposure. With to and tı representing exposure time, the 
low-light image Xto can be interpreted as a normal image V+, cor- 
rupted by a hypothetical exposure-shifting operator E'S, which only 
alters the exposure time but keeps the aperture and ISO unchanged: 


BX = ES(Vt,, to) a) 


In this process, more stochastic noise M is generated by operator ES 
because to < t1, corresponding to the low PSNR (Peak Signal to 
Noise Ratio) value in low-light images. Contrariwise, the corrupted 
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Fig. 4: The components of BPN. 


low-light image can also be restored by the same operator E'S: 


Vi, = ES (Xio, t1) (2) 


The noise N is reduced because tı > to. Thus the normal image 
Yt, is reconstructed. 

The latter process, named as Exposure Shifting (ES), can enhance 
low-light images as well as suppress image noise. More importantly, 
it can be learned in a data-driven and end-to-end way, thus it is 
expected to have superior denoising performance. However, a proper 
guideline exposure time ¢ always has to be provided in the recon- 
struction V+. To make our model adaptive, a Brightness Prediction 
(BP) procedure is introduced to achieve guideline time estimation. 

In brief, the extreme low-light enhancement process is split into 
two procedures. Firstly, for a raw-format low-light image (with 
real exposure time to), an optimal guideline exposure time tı is 
estimated via a sub-model called Brightness Prediction Network 
(BPN). Secondly, the final RGB-format enhanced image is obtained 
by another sub-model, namely Exposure Shifting Network (ESN), 
using estimated guideline time tı to control result brightness. 

In addition, besides the image itself, some extra data are involved 
in both BPN and ESN, such as white balance index, ISO, to and 
tı. The Image Enhancing Vector (IEV) is put forward to encode the 
involved extra data. 


4.2. Image Enhancing Vector 


Raw image array is the original data captured by the camera. Other 
key parameters, such as camera white balance, ISO, exposure time, 
time and date, are stored together with the raw image array as Exif 
metadata. In researches of HDR imaging[41], exposure time and ISO 
information have played an essential role in calculating the response 
curve and LDR to HDR conversion. Although it is not necessary 
to explicitly train the network to learn the camera response curve, 
feeding these additional parameters into the network as input avoids 
underfitting caused by insufficient features. 

Image Enhancing Vector (IEV) is proposed to introduce extra 
features into the convolution neural networks. In this paper, IEV 
consists of ISO ws, white balance indices for 4 raw image chan- 
nels (wr, Wg, Wp, Wg2), exposure time of the low-light image to 
and guideline exposure time tı. Moreover, it is possible to append 
other Exif metadata into IEV to improve network performance when 
necessary,e.g. aperture and focal length. 

The IEV is used as network input in both BPN (for tı estima- 
tion) and ESN (for exposure shifting), but a difference exists. As it 
is shown in Fig.2, in ES procedure the IEV can be written as: 

ye = V(to, t1) a (wr, Wg, Wh, Wg2; Us, ---; to, t1) (3) 
Where (wr, Wg, Wp, Wg2) are the white balace indices of 4 raw 
image channels and us is the ISO value (controlling the camera 
sensitivity). 

In BP procedure the tı in IEV is removed. It is renamed as partial 
IEV (pIEV) to indicate the difference: 


ye = Vp(to) = (wr, Wg, Wp, Wg2, Us, ---, to) (4) 


IEV and pIEV are fed into the convolutional neural network as 
additional channels of the raw image array. For example, the first 


extra channel provided by IEV is a uniform image filled up with 
pixel value wr. 


4.3 Exposure Shifting 


Exposure Shifting (ES) is the second procedure in our model but 
is trained first. The U-NET[42] is selected as the structure of the 
Exposure Shifting Network (ESN), Compared with the fully convo- 
lutional network, the U-NET architecture reduced the usage of GPU 
memory, thus it can easily process images with high resolution. 

Recall that the basic unit of our Campus Image Dataset (CID) is 
an image group, which consists of 8 raw-format images with differ- 
ent exposure time but captured in the same scene. Let (XR, YR) be 
the paired raw image extracted from one group, in which Vp is the 
pre-selected ground-truth image and Xp is a random-selected low- 
light image. More specifically, Vp is randomly chosen within this 
image group, with Vp and over-exposed images excluded (Fig.3). 
Moreover, the exposure time of Vp is denoted as tg. The corre- 
sponding RGB-format image of Vp is denoted as Y and its exposure 
time as tg. 

It is expected for the ESN to learn the exposure-shifting operator 
ES from a large number of paired images. Specifically, the ESN is 
required to estimate an image y} that resembles the Y as closely as 
possible. In this way ESN successfully changes the raw-format low- 
light image Xp into an RGB-format longer-exposed version of itself. 
More importantly, the serious noise of Vz can also be removed. This 
procedure can be written as: 


De = Fug (pn, vor" 


61) 
(5) 
ports — V(to, tı = tg) 


Where ESN is denoted as Fg g, its parameters as 01. Note that since 
the BP procedure has not yet involved, the guideline exposure time 
tı is replace by tg in IEV V’°*s. The enhanced result image is 
represented by V 

Two metrics to guide the training process of the ESN model 
are Mean Square Error (MAE) and multi-scale Structural Similar- 
ity (SSIM)[43]. The former is considered as a simple but effective 
indicator, evaluating the general similarity of the estimated image 
and the groud-truth image. And the later is a perceptual metric that 
is more sensitive to visible structures. Both of them are full-reference 
metrics. Let L A g be the MAE loss and L gsr m be the SSIM loss. 
The width and length of the image are denoted as m and n respec- 
tively. With subscript tg omitted, the MAE loss Lm apg and SSIM 
loss Lsszrm are defined as: 


we <6 1 swale 

Luar (9%, Y) = MAB (9%, Y) = ED F -Va 
i=l j=l 

791 = ^O 

Lssim (» V) =1— SSIM (» V) 
(6) 

The final MAE loss is the average of the channel MAE loss, for 
simplicity, the image channel is not presented in the equation. The 
calculation of the multi-scale structural similarity (SSIM) can be ref- 
ered in [43]. With subscript tg omitted, the loss function is defined 
as the linear combination of Lm Ag and LssIm: 


Lps = Lps (2y) pa 
= (1-a) Luar (9%, y) +aLssım n) 


Where a is a constant and 0 < a < 1. The influce of œ will be 
discussed in the Experiments section. 

The parameters of the ESN network ©, are learned by minimiz- 
ing Lg over K pairs of images in the training set: 


K 


ej = sna Les (DP. De) (8) 
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Fig. 5: The results of using our method to enhance multiple images in a same scene but captured with diverse exposure time. For the second 
image group, The exposure time tọ of the last image in the sequence is only 8% of that of the first, yet the network compensates for this 
difference, keeping the brightness of each image in the scene approximately uniform. 


Oj represents the optimized ESN parameters. 

The obtained sub-model Figg is an approximation to the 
exposure-shifting operator ES. However, for a low-light image out- 
side the training set, a ground-truth counterpart does not exist, 
thus the guideline time tı in IEV V°°** is absent, which leads 
to the Brightness Prediction (BP) sub-model and guideline time 
estimation. 


4.4 Brightness Prediction 


The Brightness Prediction (BP) procedure provides the estimation of 
guideline exposure time t1: 


tf? = Fp (Xr, Vp" |02) (9) 


Where Fgp and O2 is BPN and its parameters respectively. The 
pIEV is represented by ye = Vp(to). It is the first enhancement 
procedure in our model but trained after the ESN. As it is illustrated 
in Fig.4, The Brightness Prediction Network (BPN) is a relatively 
small network, which consists of multiple convolution layers and 
then ends up with several fully connected layers. Because of the 
existence of fully connected layers, the width and length of the input 
image VR should be uniformly changed into 512 x 512. 

With the estimated te obtained in this BP procedure, the tg item 
in IEV in the ES procedure can be replaced by ce. Different from 


as 1 derived in (5), the new final result image can be formulated as: 


4 Q* e 
JOrl01 = Fps (ae ver fei) (10) 


Then an approach to train the BPN is required. To begin with, the 
purpose of BPN is to control the brightness of the final result image, 
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which is achieved by adjusting the guiding exposure time tı in the 
ES procedure. Since there are image pairs (XR, VR) and their expo- 
sure time (to, tg) available, intuitively it seems that the BPN can 
simply learnt from (XR, tg) in an end-to-end way, with Vp as the 
input as tg as the label. However it cannot be achieved, the reason 
for which will be discussed later in the next subsection. 

The proposed BPN training approach and the loss function L pp 
are as follows. We define the brightness of a single pixel Br as the 
average pixel value of all color channels. Given a certain dark scene 
for image capturing and with ISO, aperture settings and the scene all 
fixed, then the brightness of a pixel is only determined by exposure 
time. Let R be the pixel irradiance in this scene, when the expo- 
sure time ¢ in camera settings increases, the pixel Exposure E(R, t) 
increases as well [44]: 


E(R,t) = Rt (11) 


Thus the pixel becomes brighter. As an inspiration, it is feasible to 
design loss function Lp p based on pixel brightness. 

It should be noted that the relationship between pixel brightness 
Br(€) and Exposure € = Rt is not linear, instead, it can be typi- 
cally described by an S-shaped curve because the pixel brightness 
is limited within [0, 1] (or [0, 255] for 8-bit image storage) and the 
camera sensor is less sensitive to the change of t when the pixel 
value is close to 0 or 1. In other words, as the exposure time t 
setting changes, some pixels are not sensitive and there are little 
changes in the pixel brightness compared with some other pixels. 
For example, when photographing a street at night, pixels revealing 
the lamp bulbs are always saturated unless the t is extremely small, 
and pixels representing the dark sky are always close to zero unless 
it is affected by noise. If those pixels are involved in BPN training, 
they can cause gradient vanishing in the back propagation algorithm 


because oBre) = Eee) -R and as it is revealed in the S-shaped 
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Fig. 6: The results using different methods on CID test images. From (a) to (g), they are the original low-light images and the results of 
multiscale Retinex (MSR), MSR with BM3D, illumination map estimation based (LIME), LIME with BM3D, deep Retinex decomposition 
(R-net), R-net with BM3D, learning-to-see-in-the-dark (SID) model and ours respectively. 


culve, oe) is close to 0. On the other hand, when pixel brightness 


Br is around 0.5 the opr reaches its maximum. The pixels satisfy- 
ing this condition can be collected into an Area of Interest (Aol), 
which identifies pixels that are sensitive to the process of Exposure 
Shifting. 

First, in the ground-truth image y, an Aol weight map W can 
be obtained by filtering out pixels with brightness near 0.5. More 
specifically, the ground-truth image V is converted into a single 
channel grayscale image Yq, then a Gaussian culve with mean 
Hw = 0.5 and variance a2, = 0.01 is applied on each pixel of Va: 


2 
Hoca ( (Va = Hw) (12) 


202 
Then Wg is normalized, the result is the Aol weight map W. The 
value at position (7, j) is denoted as subscripts, namely WG 4; and 
ij: 
E Waij 
Lia Dj- Waij’ 


Combining the Equation 12 and 13, they can be simplified by the 
softmax function: 


2 
W = softmaz (-Se=pe") (14) 


202, 


Wij Vl<i<ml<j<n_ (13) 


Lastly, the loss function Lpg p is designed to check if the final result 
image yerler has the same Aol. Having the same Aol means that 
in this Aol area yee IO] and Y share similar brightness. Moreover, 
recall that pixels outside Aol are not sensitive to the change of expo- 
sure time t or the ES process, thus it can be derived that the overall 
image brightness of yee resembles that of V. In this way, BPN 
completed the adaptive Brightness Prediction objective. 

Then enhanced image ype 187 is converted into a grayscale 
image ye, When the Aol of ye2l0r and y overlaps exactly, 
the following loss function reaches its minimum: 


Lep=Lpp (939, wloz; 67) 


~@5|0* 2 
Joris — Hu) a5) 
202 We 


saa 


i=1 j=1 


Where o2 = 0.04 is another variance constant. The image value 
at position (i, j) is represented by 7,7 subscripts. Finally, BPN is 
trained by optimizing © over K pairs of images in the training set: 


K 
@3 = argmin ` Lap (Saxo. wozi) ao 
2 k= 
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4.5. Method Summary and Discussion 


In Algorithm 1, the training process of our model is summarized, 
together with the method to evaluate and apply this model. 


Algorithm 1 Model training and evaluation 


1: Initialize ESN, BPN parameters ©; and O2 

2: Prepare CID dataset 

3: function TRAIN(Dataset) 

4 for EpochTrainESN = 1 > E; do 

5 for Image Pair k = 1 —> K do 

6: Read raw image pairs (XR, k: YR,k) and vi 
T 

8 


ott, 


Convert to RGB format. Vk + YR,k 
Compute pe 1 via Equation 5 


9: ©) + argming, DF Lgs (SP) 
10: end for 
11: end for 


12: Oï «+O, 
13: for EpochTrainBPN = 1 —> E> do 


14: for Image Pair k = 1 —> K do : 
15: Read raw image pair (TR, ke: rx) and Lee 
16: Compute i. and yo via Equation 9 and 10 
17: Convert to grayscale image. Vor — YVR,k 
i O2|OF 82/07 
18: Convert to grayscale image. Yo", * <— Vy 
19: Compute Wy via Equation 14 — 
20: O2 + argming, D Lpp Og Wr | O23 67) 
21: end for 


22: end for 

23: 05 + 02 

24: return OÏ, 05 
25: end function 


26: 

27: function EVALUATE(Image) 

28: Read raw-format image VR 

29: Read all elements in yo from raw-image file 


30: BPN: Compute ie «+ Fgp (xr, ve i) 
iaria? 93 
31: ESN: Compute Y°2!91 + Fps (xn yomi fei) 


32; return jez loi 
33: end function 


As mentioned earlier in the last section, training BPN with input 
Xp and label tg is straightforward but not feasible. First let F$ p 
be the hypothetical optimal BPN model, then t] (VR) = Fý p(¥Rr) 
is the optimal estimation of guideline exposure time. For a pair of 
image and time label (Xp, tg), the tg can be considered as one sam- 


pling from a Gaussian distribution: tg ~ M (ti, 2). The variance 


o2 in this process is too large because: 


e In every CID image group, the ground-truth image is chosen 
manually and is dominated by subjective human judgments. 

e The ground-truth image is selected from only 8 candidate images, 
but the optimal t{ exists somewhere in the time continuum (0, inf). 


Furthermore, the number of samples is very limited since each image 
group can only provide one tg sample. As a result, this straightfor- 
ward approach cannot be applied in the training of BPN due to the 
high variance of the label tg and the limited number of samples. 

5 Experiments 

5.1 Training 


Limited by the GPU memory, we did not adopt the Batch Normaliza- 
tion [45] technique. To accelerate model training, before the training 
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Table 1 Comparison results with non-reference image quality metrics. MSR, 
LIME, R-Net methods are combined with BM3D to make a fair comparison. No- 
reference quality metrics, including Statistical Noise Level Estimation (SNLE), 
NLE(Noise Level Estimation), NV(Noise Variance) and image entropy, are 
applied to evaluate model denoising and brightness control performance. 


Methods SNLE NLE NV Entropy 
Original = - = 4.0689 
MSR+BM3D 0.0014 1.3792 0.8998 7.0068 
LIME+BM3D 0.0086 4.2437 10.1789 7.1084 
R-Net+BM3D 0.0017 0.2467 0.4491 6.1503 
Ours 0.0015 0.1061 0.4554 6.6112 


process started, each channel of the input image, as well as each 
element of IEV, was normalized along the sample dimension. Then 
the means and variances used in the normalization became invariant 
constants. Additionally, images for training were randomly cropped 
into 512 x 512 patches. 

Both ESN and BPN were trained with Adam optimizer[46], fol- 
lowing the procedures shown in Algorithm 1. The ESN was trained 
first with the learning rate slowly descending from 2 x 1074 to 
1 x 1075. After about 300 epochs, when the loss on training sets 
became stable. Then the parameters in ESN were made untrainable. 
Then BPN was appended to the training graph and trained with the 
same learning rate. Considering BPN may require global informa- 
tion to decide the proper exposure time, we used bigger patches but 
avoid training with the whole image to prevent overfitting. Although 
the loss must propagate through the deep ESN before reaching BPN, 
the training process went smoothly and completed after 100 epochs. 


5.2 Content-Based Brightness Control 


Firstly, we highlight the adaptive brightness control feature of our 
method. Here we explicitly define the brightness as the mean value of 
an image J, denoted as Br(). We expect a dynamic adjustment of 
the image brightness based on the image content. More specifically, 
while extreme low-light images suffer from severe noise along with 
color distortion and are barely visible before post-processing, other 
mild low-light images only have moderate noise and relatively lower 
brightness compared with normal images. Thus, the enhancement 
algorithm needs to make adjustments adaptively for different input 
images. Currently the classic methods, such as algorithms based on 
Retinex and Dehazing, have poor performance on the extreme low- 
light images, on the other hand, learning-based methods targeting 
at end-to-end denoising can process all kinds of low-light images, 
but they lack the flexibility and adaptability because they either need 
brightness amplification hand-tuning, or simply serve as a denois- 
ing procedure in other low-light methods. The proposed model is of 
the advantages of both and is able to process low-light images of 
different levels. 

In order to evaluate the adaptability, we apply the trained model to 
enhance all images in the test set. Recall that the image group is the 
basic unit of our CID dataset, accordingly, instead of evaluating sin- 
gle images, the collaborative characteristics across different images 
are investigated within each image group. Note that the group iden- 
tity of an image is NOT involved in model training and only serves 
as a tag to gather resulting images for further analysis. 

For a sequence of multi-exposed images (VR1,VR,2,...) with 
increasing exposure time, it is expected that after enhancement 
the resulting images possess similar image brightness, namely 
Br(Vi) ~ Br(Y2) * ..., where Y denotes the enhanced image. 
Fig.5(a) illustrates a busy-road scene, the first row displays the RGB 
images produced by the camera, the second displays the enhanced 
results by utilizing our method on individual raw images. Note that 
the first image in this group is exposed for 1/20 seconds, causing 
motion blur to the truck, while the last image is exposed for only 
1/100 seconds. It is shown that all four images have approximately 
uniform brightness after enhancement, despite the changing to and 
the moving vehicle headlight. Fig.5(b) presents the performance of 
our method on extreme low-light conditions. The last image in the 
sequence suffers from both low environment illumination and short 
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Fig. 7: Controlled experiments on the impact of SSIM loss. 
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Fig. 8: The shift of average brightness and brightness coefficient of 
variation (CV) within each group. Each Group contain 3—6 images 
captured in a brust with different exposure time, e.g. Fig.5. In both 
test set and training set, an increase in brightness and reduction in 
CV is observed. Note that the CV decreased considerably, indicating 
the highly uniformed brightness for images within each group after 
enhancement. 


exposure and the color and shape of objects in the image are drowned 
out by noise. However, the resulting image has the noise reduced and 
still maintain acceptable global brightness. 

As mentioned above, for quantitative analysis, we gathered 
images that belong to the same groups and calculated the bright- 
ness of all images in order to observe the shift of brightness. Let 
(Bri, Bro,...), be the image brightness of each unenhanced low- 
light image in k-th image group and let (Br, Br, ...), be that of 
the enhanced ones. Then, the mean value and the CV (coefficient of 
variation) of (Br 1, Brg, ...) is computed and denoted as (j1,cv)x, 
correspondingly, (w, cy)ę is calculated from (Bri, Br4,...)y. In 
Fig.8, we plotted (j1,cy) and (p, c4) for all groups. To illustrate 
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Fig. 9: The comparison of image original exposure time tg and BPN 
assigned time tı. BPN infers tı based on not only the original time 
to but also the content of the scene. The color indicates the mean 
global brightness value of enhanced images within each group. 


the change after enhancement, for each group (j1,cv) (marked with 
star) and (y’, ci) (marked with circle) are linked with a dotted line. 
It can be observed from Fig.8 that for different groups that cap- 
tured from diverse scenes, the enhanced images all end up with 
increased image brightness and significantly reduced CV. And yet 
the brightness growth between groups can be very different, rang- 
ing from around 60% to over 1,000%. Considering together with 
Fig.exp, Fig.8 indicates that the BP sub-model is able to adaptively 
control the global image brightness according to not only the bright- 
ness of the original image, but also the content of the scene. On the 
other hand, the dramatic reduction of CV suggests that the enhanced 
images within each group share almost identical brightness in spite 
of the varying exposure time to. To be precise, our model can control 
group brightness CV under 0.29 on 76.3% test set groups and 98.5% 
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Fig. 10: Applying our method to dynamic low-light video enhance- 
ment. 


training set groups considering the influence of the diverse scenes, 
illumination conditions, capture ISO, etc. 

When designing our model, we expect the BPN to extract features 
of the scene content and then estimates a reasonable tı accordingly. 
In order to see through clearly how this black box works, Fig.9 is 
plotted to illustrate the relationship between the real exposure time 
to, the BPN-estimated guideline exposure time tı and the brightness 
of enhanced images Br()). Similar to Fig.8, we use group average 
to simplify the figure, in which each point corresponds to an image 
group. As shown in Fig.9, the network tends to estimate tı positively 
associated with to. Furthermore, for groups with identical average 
to, the contents of the scenes have an impact on tı prediction. To 
be specific, if the scene of a group has more relatively bright areas, 
a smaller tı is more likely to be adopted to prevent overexposure 
and vice versa. And yet a few exceptions exist, suggesting the BPN 
sub-model has learned some more sophisticated rules in the neural 
network black box. 


5.3 Denoising Evaluation 

We compare our method with four classic and state-of-art methods, 
including multiscale Retinex (MSR)[5], illumination map estima- 
tion based (LIME)[7], deep Retinex decomposition (Retinex-Net, 
R-Net)[9] and learning-to-see-in-the-dark (SID)[15]. For fair com- 
parison, 3D transform-domain filtering (BM3D)[10] is applied on 
MSR, LIME, R-Net, since those methods do not model noise and 
require extra denoising process. Additionally, we train the SID 
model on the CID dataset and manually select ratio for the SID 
model since it does not provide automatic brightness estimation. 

Recall that the BPN utilizes a reduced reference evaluation index 
as the loss in its training. In this way, the exposure parameters of 
the ground-truth image, which is chosen by highly subjective human 
judgment, will have much less influence on BPN model training. As 
aresult of this novel technique, however, PSNR and SSIM evaluation 
cannot be adopted because there is no reference image. We provide 
perceptual comparisons in Fig.6 and evaluate our method with no- 
reference quality metrics. 

In Table.1, we compare the proposed method with MSR, LIME 
and R-Net(Retinex-Net), using several no-reference quality metrics 
including Statistical Noise Level Estimation (SNLE) [47], Noise 
Level Estimation (NLE) [48], Noise Variance (NV) [49] and image 
entropy. SNLE, NLE and NV metrics can estimate noise level from 
a single image, and clean images have relatively smaller values than 
noisy ones. The SNLE estimates noise variance by observing the 
eigenvalues of images and the NLE by applying the principal com- 
ponent analysis (PCA) on special image patches. The NV metric is 
motivated by the fact that clean and noisy images have different sen- 
sitivity to the Laplacian operation. The image entropy reflects the 
average amount of information in an image, it is the most simple 
image evaluation metric and images with proper overall brightness 


Research Journals, pp. 1—11 


have larger image entropy. We run evaluations on about 400 CID 
test images and the SID method is excluded because it requires 
an external manually-selected amplification ratio for every image. 
As it is shown in Table.1, our method has superior performance 
on Noise Level Estimation. Moreover, the no-reference metrics use 
very different standards to assess image quality. MSR+BM3D, R- 
Net+BM3D and LIME+BM3D methods all have undesired scores 
on one or multiple indices, while our method has good performance 
on all metrics. 

It is illustrated in Fig.6 that our method and SID method provide 
more significant improvements in noise removal than classic method 
BM3D and can adaptively denoise images with diverse noise lev- 
els. Furthermore, our results benefit from the white balance index 
introduced in IEV and avoid abnormal color in extreme low-light 
images. 

However, the CNN-based method tends to blur the object with 
sharp edges (e.g. text). By introducing SSIM into loss function, our 
method shows more promising results in images with text and has 
advantages over SID in keeping more edge information, e.g. the 
images in the first row of Fig.6. But on the other hand, in controlled 
experiments shown in Fig.7, it is revealed that the ratio of SSIM 
component a in loss function Lpg s has to be contained to avoid the 
noise being interpreted into texture and edge by ESN. 


5.4 Low-light Video Processing 

The advantage of our work lays in getting rid of the brightness 
control ratio while achieving state-of-the-art low-light noise suppres- 
sion. Without any handcrafting parameters as network input, we can 
directly apply our method to raw low-light video processing. 

A raw-format video can occupy a very large amount of storage 
space. Limited by devices, we only developed a relatively simple 
way to verify our method on raw low-light videos. In Fig.10, we 
experiment on 21 continuous raw images captured in a burst session, 
shot with short exposure time and identical ISO, to discuss the poten- 
tial applications in raw video processing. In this scene, the camera 
starts from a bridge with lighting decoration, then turns to the dark 
riverbank, and finally to the dimly illuminated sidewalk. As is shown 
in Fig.10, though the enhancement process of each frame is com- 
pletely independent of other frames, neither the estimated tı nor the 
resulting images show any sudden fluctuation. However, for practi- 
cal considerations, it is also convenient to implement guideline time 
filters between ESN and BPN, which allows the value of tı to be 
dynamically filtered or amplified according to different demands. 


5.5 Computational Cost 

The proposed algorithm is evaluated on the ModelArts cloud ser- 
vice, which is equipped with 8 vCPUs and an Nvidia-P100 GPU 
(16GB GPU memory). The execution time for a high-resolution 
image (3968 x 2976) is 0.27 seconds on average, which only shows 
a minor increase compared with the SID algorithm (0.21 seconds). 
More specifically, the execution time of BPN and ESN sub-model 
is 0.05 and 0.22 respectively. For comparison, it takes 4.32 seconds 
for BM3D (GPU implementation [50]) to complete the denoising 
process, therefore other low-light methods that depend on BM3D 
for denoising, including MSR, LIME, R-Net et.al., are much less 
efficient on devices with GPUs. 


6 Conclusion and Discussion 


6.1 Conclusion 

In this paper, an adaptive raw image enhancement model is proposed 
for extreme low-light image processing. The two-stage framework 
provides adaptability to images with different scenes and exposure 
parameters. We also presented the CID dataset for model training 
and evaluation. The experimental test shows that our model can 
provide the state-of-the-art low-light image denoising as well as 
adaptive global brightness control. 


The proposed method has many advantages over existing 
approaches. In the first place, no external parameters are needed 
after model training, a single raw image and its Exif metadata 
(e.g. white balance, ISO and exposure time) are sufficient to com- 
plete the process. Therefore, our model is able to process a large 
batch of raw images without parameter-handcrafting. Secondly, our 
model significantly outperforms other methods in which denoising 
works as a separable step. because instead of simply implementing 
noise-suppression algorithms (e.g. BM3D) before or after the main 
low-light enhancement procedure, we make denoising procedure an 
integrated part in our model, allowing the ESN module to learn expo- 
sure shifting and denoising simultaneously in an end-to-end way. 
In quantitative experiments, we adopted no-reference image quality 
metrics including SNLE, NLE, NV and image entropy to test the 
denoising performance. Our model has the smallest noise variance 
on NLE metric compared to BM3D-based low-light methods and 
obtains approximate optimal scores on the other three indices. We 
also reveal the potential application in low-light video processing. 
The BPN module makes it possible for our model to process images 
with diverse exposure levels, from extreme low-light images to mild 
ones. But to be applied in video processing, it should face more chal- 
lenges, e.g. the brightness of frames should change continuously and 
avoid fluctuation. We experiment on continuous raw image series, 
despite that no information of adjacent frames is used, the output 
of BPN changes continuously with the illumination condition and 
shows no fluctuation. 


6.2 Limitations and Future Works 


The model proposed is limited by dataset, which is a common prob- 
lem in data-driven methods. On the one hand, the model is trained by 
raw-format and multi-exposed images, which significantly improved 
image quality. But on the other hand, all reference ground-truth 
images are captured in the real scenes, as a result, they sometimes 
have minor defects such as noise, halos around the light source, local 
over-exposure and white balance deviation. The denoising perfor- 
mance may be further elevated if those drawbacks can be avoided. 
Then the image groups in CID have not covered enough nature 
scenes. For example, it is found that green grass and trees tend to 
have low color saturation in the resulting images, that is because 
most images in the training set are collected in winter when there is 
hardly any green plants in scenes. For future study, the raw-format 
CID dataset can be extended by unprocessing [14] technique and 
generating raw-format images from other available online datasets. 

Another limitation is algorithm memory consumption. The batch 
size is constrained because it takes too much GPU memory to 
train the proposed model. Therefore training the model can be 
time-consuming on GPUs with small memory. We are planning 
to improve network architecture and investigate the possibility of 
application on mobile devices. 

As a future expectation, the model can be further modified by 
introducing HDRI (High Dynamic Range Imaging). Since our CID 
dataset is composed of multi-exposed images, we can calculate an 
HDR image of each scene. Finally, the HDR images can be used as 
ground truth to participate in the ESN training. 
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